Split Zone Analytics - The Book

Sam Assaf

404-680-6269

samassaf7@gmail.com

Outline

Abstract & Example : Analysts & Coaches

In week 14 of this year’s NFL season (2023), the Pittsburgh Steelers hosted the New England Patriots on Thursday Night Football. That evening, numerous 4th-down calls were questionable from an analytics’ perspective. This report is an in depth explanation of the process utilized to create “The Book, an analytics-based advisory tool for football coaches to utilize when uncertain what play to run on 4th down given the game situation. In order to demonstrate how this information would be used in practice, a specific play in the Steelers v. Patriots game will be highlighted, and the model (henceforth “Split Zone Model”) will suggest the play type that yields the highest increase in the probability of winning. The offensive coach can then take this into account as they are calling the play.

The Example: With 5:06 left in the 4th quarter, leading by 3 points, head coach Tomlin of the Steelers was faced with a decision on 4th and 3 with the ball on their 38 yard line (62 yards from the scoring endzone). After some deliberation, Coach Tomlin decided to line up in punt formation, likely with the intention to punt, but the long snapper was called for a false start penalty, resulting in a 5 yard loss and putting them on the 33 yard line, with a 4th and 8 decision. Prior to the penalty, “The Book” strongly advised to “go” in this situation. This was an ideal scenario for a coach to consult “The Book” to get analytical advice quickly and efficiently. As the Split Zone Model’s predictions stop at 60 yards from an offensive team’s scoring endzone (a team’s own 40 yard line), it is possible to extrapolate and predict beyond the boundaries that are explicitly stated, and infer with the knowledge of lost precision as one strays away from the confines of the model. However, after the additional 5 yard penalty, it might be too far from the bounds of the model to offer helpful insight. Figure 1 provides a visual representation, which through the use of colorized boxes, gives a coach instant visual cues that guide decision making. The figure uses a common analytical metric called “Win Probability Added - WPA,” which is defined as the percent change in a team’s chances of winning from one play to the next. In the figure below, the color represents the difference between the WPA of running/passing (i.e., going for it) versus the WPA of punting/field goal (i.e., not going for it).

Figure 1: This figure shows the difference in the two predicted win probability added (WPA) values associated with passing or running the ball (i.e, “go for it”) or kicking a field goal or punting (i.e., “not going for it”). The brighter the green, the more strongly the Split Zone Model recommends going for it; the darker the blue, the more strongly the Split Zone Model recommends not going for it; however as the color approaches white, the guidance offered is less decisive.

Given the evidence presented in Figure 1, which incorporates variables (described below) that are specific to this game such as both teams’ strengths and weaknesses and the game situation described above, the model relatively strongly advises to go for it for the 4th and 3. Not only does a pass or run (go for it) have a higher WPA value, but when incorporating a base risk aversion adjustment and a punt risk aversion adjustment (both are described below), there is a statistically significant difference between going for it and not going for it, lending strength/confidence to the decision.

When looking at the distribution of historical play types similar to the situation described (see Figure 2), it is evident how risk averse coaches are, and how guidance from the Split Zone Model could help coaches make the decision that offers them the highest increase in probability of winning.

Figure 2: Historical counts of play type chosen when faced with a 4th and 3, 46-60 yard line

In this report, there are sections that are identified as beneficial for both analysts and coaches, and other sections that would likely only be useful for analysts. Nonetheless, anyone with a base level of statistics and minimal guidance can follow along and likely understand the findings of this report.

Introduction : Analysts & Coaches

The goal of this project is to provide NFL coaches a binder of game situation sheets, The Book, they can quickly consult during a game. The Book, which is created using a proprietary algorithm called the Split Zone Model, provides simple visualizations that are easy to read and process in real-world, high pressure situations. The Book is unique to each team and opponent, and will be utilized when a coach is faced with an uncertain situation, meaning that it is unclear what the next play call or play type should be. One can watch virtually any football game and hear announcers make comments like, “the analytics say you should go for it here.” This is even more intriguing considering there is no obvious way to know what is behind commentary on predictions.

Relatedly, in collegiate football, many teams, including the University of Notre Dame, utilize the “Game Book,’’ an analytics-based binder developed by Championship Analytics, that they use for each game. The “Game Book” provides a recommendation on whether to go for it, kick a field goal, or punt based on the time remaining, score differential, field position, and yards to get a first down. It is unclear all the ways in which The Game Book, from Championship Analytics, differs from The Book; but the way in which the visuals are displayed and the description of the game situation are certain. It is impossible to know the specific variables that similar products use to create their analytical books.

It can be argued that there has been a change in how football games are played that correlates with this rising awareness of analytics. The plot shown in (see Figure 3), from Ben Balwin, a well-known football analytics’ expert, illustrates the increase in frequency with which head coaches “go for it” when the “analytics” says they should, during the 2014-2020 timeframe. Although we are careful to deduce causation strictly due to the correlation, it is reasonable to believe that awareness of analytics (re: math) accounts for the difference in coaching decisions.

Figure 3: Difference between “go for it” decisions from 2014 to 2020. (Ben Balwin)

As a Division 1 collegiate student-athlete in football, I have had the opportunity to see first hand how advances in analytics have changed the game of football, and this has served as the main motivation for this project. My perspective as a football player and sports analytics aficionado provides me with unique insights into the implementation of analytics, but it also forces me to recognize the human element in the game. Football is a game that is played by players, and no matter what the analytics might advise a coach to do, it is the player who implements the play and ultimately makes it succeed or fail. That being said, the use of in-game analytics is undoubtedly a growing trend, and coaches are starting to seriously weigh the analytical advice when approaching a tough decision, specifically whether to go for the first down or not.

Problem Framing : Analysts

As previously noted, the broad goal of this work is to offer advice to a head coach and/or offensive coordinator (i.e., whoever the decision-maker is) at a moment’s notice. Practically speaking, there is often an analyst or assistant coach who physically holds the printed pages – usually in a large binder since analytics’ computing devices of any kind are banned on NFL sidelines and in the booths – with analytics’ visualizations that predict the most beneficial play type to call, relative to the current game situation. This work specifically addresses 4th down decisions, with the goal of optimizing win expectancy by inspecting all possible reasonable play types. However, in order to do this at a moment’s notice, the model is limited to mainly inputs that are calculated and determined before the game. This allows the model to account for strengths and weaknesses of the team of interest and their opponent, while having predictions that vary depending on the time left in game, yard line, score differential, and play type.

This research specifically focuses on the analytics surrounding play calling in the NFL. The decision to concentrate on the NFL is in part due to the relative stability of player rosters, particularly among impact players, compared to the collegiate level. This choice is further motivated by factors such as the cleanliness of and accessibility to NFL data. It is worth nothing, however, a similar model could be implemented with college football teams, contingent upon data availability.

Data Overview : Analysts

The entirety of the data that was used for training, testing, and implementation of this project was from nflfastR. The Split Zone Model was created using play-by-play data starting in 2010 and continuing through to 2022, for all NFL teams. It is noteworthy that the data available did not only include standard play by play data, but also included the weather (temperature, wind, and conditions, but not humidity), the stadium where the game is being played, and the surface type.

Before the training and modeling was done, some data preparation was required. For simplicity, some of the more typical data cleaning processes will not be addressed, and instead there will be a focus on the data wrangling that is more specific to this project. There are a few steps in the data cleaning and creation process that should be mentioned specifically.

The Elo and EPA (Expected Points Added) statistics are used in the model to capture teams strengths and weaknesses. This ranges from how good a team has been playing, to how a team run offense or defense has been performing. In order to calculate the Elo statistics (a measure of the current strengths of a team) and certain EPA statistics, the week 1 observations from the training and test data sets were removed since no historical data from the present season existed yet. That being said, if week 1 guidance is needed, predicted Elo and EPA statistics or average Elo and EPA statistics from the prior season can be used so “The Book” can be created for the week 1 game. For the Elo statistics, a relatively high k factor (this governs the amount of change to the updated ratings) was chosen due to the limited number of games in the NFL.

As noted, weather is used as a variable in the Split Zone Model, but one complication is that some observations were missing. In order to preserve observations, certain games had their weather statistics imputed with the mean of the respective weather measurement for the respective stadium. In addition, no specific weather details were artificially imposed on the game. If any domes or closed stadiums were missing this information, it was considered to be 70 degrees with 0 mph wind.

One contribution of the Split Zone Model is that it clusters (bins) continuous data into categorical variables (ranges); this was done for three reasons. First, it is imperative to provide coaches with a reasonable number of sheets in “The Book” for them to consult during a game. Modifying how much the in-game variables could change by bining them was necessary in order to limit the number of permutations / combinations of advising sheets. Second, the information presented in the sheets is simplified (i.e., there are far fewer data points to consider), making it less cognitively demanding resulting in easier and quicker decisions for coaches. Finally, putting the data in bins allows the machine learning model to be able to split the data based on typical football knowledge that was used to create these bins. For example, competing models typically use continuous variables, for example, the time left in a game. This is measured in seconds, and if left as a continuous variable, the model would account for every second-by-second value left in a game, meaning there would be 3,600 different potential values to split the data on. Utilizing basic football knowledge and looking at WPA distribution by seconds left in the game, bins were manually chosen for this continuous measure (bins are listed below, see ). This was done for the score differential (relative to the possession team) as well (bins are listed below). The thought process for choosing these was as follows: a field goal difference or less, a touchdown (with two point conversion) or less, a two possession game, and more than a two possession game. The last binned variable that was created was yardline. This was based on an underlying understanding of what analytics initially predicted and a break at the 38 yard line, which was identified as the end of field goal range. Note that these yard line bins can easily be altered to fit a coaches specific play calling tendency by yards away from scoring end zone and what they identify as the kicker’s range.

In order to allow the model to better utilize general football knowledge that might otherwise be overlooked or unseen, yardline bins were created. This visualization is presented to show that these yardline bins do in fact capture the general movement of the average WPA (for 4th downs only & inside 60 yards) when grouped by individual yardline.

As a reminder, the data was filtered to only 4th downs. This was supported by training a model on 1st through 4th downs, and then training and testing the same model only on 4th downs, the models missed predictions in similar fashion. In addition to the general belief of training the model on what the model will be used for. Following the same logic, the data were filtered to only plays starting from 60 yards (own 40 yard line) or less away from the scoring end zone. This cutoff was chosen with the knowledge that coaches rarely go for it from within their own 40 yard line. In fact, looking at all 4th downs inside their own 40 in the data set from 2010 to 2022, when the game was within a two possession game, teams went for it less than 5%. Furthermore, when filtering out the last two minutes of a game, teams went for it about 3% of the time.

When comparing the models trained on the whole dataset against those filtered on those 60 yards away from the scoring endzone and less, there was no difference in how they missed. It is worth noting that this reduces the total dataset from 480,000 plays to about 29,000 plays, but again the models did not predict any difference when testing on 4th downs, so it was not deemed necessary to keep the larger data set. In other words, keeping observations of plays that were not 4th down and not 60 yards or less from your scoring endzone did not provide any additional predictive information to the model.

Variables in Model : Analysts & Coaches

In general, the data set for training contained:

Outcome Variable Game Data Play Data
Win Probability Added (WPA) Stadium Play type (input)
Roof Field goal probability
Surface Yards to go (first down or score)
Temperature Yardline (continuous and binned)1
Wind EPAs2
Weather detail ELOs2
Score differential (binned)3
Game time remaining (binned)4
Score difference time ratio5
1 Continuous vs. bins: 1 - 60 & Bins : [1, 5], [6, 15], [16, 25], [26, 38], [39 to 45], [46, 60]
2 Various aggregated and lagged measures - for both teams, offensive pass & run game, defensive pass & run game
3 Bins : (-∞,17], [-16,-9], [-8,-4], [-3,0], [1,3], [4, 8], [9, 16], [17,∞)
4 First Quarter : [15,0], Second Quarter : [15,4), [4, 2), [2,0], Third Quarter: [15,0], Fourth Quarter : [15, 7), [7, 4), [4, 2), [2, 0]
5 Used the average time remaining for each bin of time remaining and most frequently occurring score differential for each bin to calculate this measure.

Table 1.

Data Visualization : Analysts & Coaches

Figure 4: This plot shows the distribution of WPA by play type and time remaining. The plot shows that the WPA for a field goal attempt is relatively constant, while the WPA for a punt decreases as time goes on. This is likely due to the fact that the closer to the end of the game, the more likely a team is to be losing and therefore the more likely they are to go for it on 4th down.

Figure 5: This plot supports the expected relationship between a team’s pregame ELO rating and their lagged pregame run EPA. As expected there is a clear positive relationship between the two. This visualization is intended to show that ELO is a fair metric to assess a team’s strengths and weaknesses, assuming that EPA is a valid statistic for assessing a team as well.

These visualizations do not cover the extent of information the data covers, but they give valuable insight into the type of data contained in the data set and what insights and outcomes might be gathered from analyzing this data and building a model.

Methods : Analysts

The final data set consists of columns that are either numeric or boolean and it contains over 29,000 rows and 93 predictor columns. Data is split into training and test data sets, which contain 80% and 20% of the rows, respectively. The method used to create the final prediction heat map (see Figure 7) utilizes a matrix of predicted win probabilities added by play type options and game situation and was done using regression modeling. Specifically, this was done with XGBoost, a type of gradient boosting algorithm. In simple terms it works by creating regression trees and splitting at values of variables in order to minimize the error in predictions. At the end of the modeling this allowed us to look at various plots to see which features were most impactful in the modeling.

Figure 6: The 15 most impactful variables to the final model, colored by the feature value. For example quantitative measures such as ELO, where a low (yellow) is 1000-1300, middle (pink) is 1300 - 1600 and high (purple) is 1600 - 1900. Qualitative measures, like our binned game time, are either true (1) or false (0) for that respective bin. Each point represents an observation and the movement along the x-axis (SHAP value) represents how much it impacted the prediction. A significant number of these top 15 variables were to be expected. It was very encouraging that many of the bins which were created were among these 15. This is a result of creating the bins by looking at the distributions of those continuous variables and utilizing football knowledge to help nudge the machine learning model in the correct direction.

We tune the hyperparameters that XGBoost uses to allow the user to help the algorithm avoid overfitting while also minimizing error measurements. However, in practice, it is not always optimal to choose the option that has the lowest RMSE. This model is not singularly predicting WPA, but rather it would be used in conjunction with a matrix of all possible game combinations and play types showing the differences in WPA while holding the game situation constant and changing the play type. Although most of the hyperparameters were relatively straightforward, two models were made with etas (learning rates) of 0.1 and 0.05, and a comparison of how they performed was made. Ultimately the final model ended up utilizing the eta of 0.05. This allows for the model to capture relatively more complex patterns observed in football that might be missed with an eta of 0.1.

In addition to choosing hyperparameters, another impactful choice was what variables to include in the final model. Throughout the testing process, the number of columns changed slightly while the number of rows remained consistent. This study was limited as it could only use variables in the model that changed during the game (game situation variables) and in order to keep “The Book” to a manageable number of pages, categorical (binned) variables (described in the Data Overview) were created and used. Using these binned variables and combinations of these binned variables positively impacts the model’s ability to produce reasonable and insightful advice. While it was certainly necessary to convert the game time remaining in seconds to a binned categorical variable, the decision was not as clear related to how to handle yard line, which is a more coarse continuous variable. Similarly, it was important to decide about using a continuous score differential or categorical score differential. After comparing all combinations of continuous and categorical input, it was ultimately decided to include both for the yard line (yards to scoring endzone) but only use the categorical score difference and game time remaining.

In order to compare different models when standard measures of model performance are not as applicable, multiverse analysis can be used. Simply put, four copies of each row in the test data set were created and only the input play type was changed for each row. Then predictions were made for each row and the comparisons between input play types were evident. These predictions were then compared and led to the selection of the best model by comparing to what could be seen as “reasonable” predictions. Throughout this testing, the decision was made to help the model identify these complex football patterns by utilizing the created categorical variables. This was done in order to make more logical and reasonable predictions (in the football sense) and give the model the best chance of identifying the patterns that could help advise a coach. For example, this study operates under the premise that a coach would never punt when they are in field goal range (line of scrimmage less than or equal to 38 yard line) and they would never kick a field goal when the line of scrimmage is past the 38 yard line (greater than a 55 yard field goal). Additionally, the training and test data sets filtered out these observations. The rationale for this decision is presented in the plot below (see Figure 8), which shows the count of field goals taken by the line of scrimmage on the x axis and it is colored by the predicted field goal probability, which is based on a large sample of historical field goals from that line of scrimmage yard line. This cutoff could be altered based on the kicker, time of season, weather, and any other, relevant variables.

The final model was identified after more training and testing. It is easy to imagine how the predictions for a game would be made, but actually doing these predictions is a unique challenge. Ultimately a data set that had every combination of continuous and categorical yard line, categorical time, categorical score difference, yards to go, and play type, was created. After getting all possible combinations of those variables, the data that depended on those variables (e.g., : categorical yard line and field goal probability) was merged into the data set, in addition to the pregame determined measures and information input.

Model Adjustments : Analysts & Coaches

When comparing models and looking at their predicted play types, ordered by best WPA prediction to least WPA prediction, given a base level of football knowledge, concern with some of the ordered play types based on the situations was raised. The level of concern was elevated when the entire data set was utilized to make every possible prediction, and it rarely resulted in kicking a field goal, and even more rarely recommended punting as the best decision. This was extremely demoralizing, prompting concern that the model and work were worthless. Shortly after these results were generated, it was confirmed that in reality “the analytics” rarely say to kick a field goal or punt. After some discussions and research, it was decided that it would be best practice to add a risk adjustment parameter to shift the WPA predictions for field goals and punts (not go for it decisions) to make them comparable to pass and run WPA predictions. This was completed with three different adjustments. The first change was a base adjustment to the “not go for it” options, which were field goal WPA prediction if the yardline was 38 yards and under (for being a 55 yard field goal or less) and an adjustment to punt if the field was further than that. The second adjustment is a punt specific risk aversion to give the punt play types even more of a bump. The last risk adjustment is a coach chosen risk aversion factor that is not context specific ranging from 0 to 1, where 0 is no risk aversion and 1 is very risk averse. This risk aversion, and all risk aversion values can be altered for specific games.

Examples of predictions with and without the risk adjustment are presented below (see Figures 9a and 9b, no risk adjustments versus with risk adjustments, respectively). For simplicity, the given game situation will include the same teams, but a slightly different situation. In order to create a situation with a less obvious outcome/prediction, consider it is the 4th quarter with 15 to 7 minutes left, and the Steelers are losing by between 4 to 8 points (one possession game).

Model Without Risk Aversion Adjustments : Analysts & Coaches

Above are the model’s predictions without any risk adjustments (field goal and punt base risk aversion adjustments, punt specific risk aversion adjustments, and coach specific risk aversion adjustment). The model clearly favors passing or running (going for it) to field goal or punting (not going for it) in almost every scenario; actually the model does tell the coach to “go for it” in every scenario when the team needs10 yards or fewer to get the first down or or a score. While this is what the raw and unadjusted analytics advise, in actuality, if a coach was presented with this, he would almost certainly disregard any other advice generated by this model. This is likely due to the fact that coaching is inherently a risk averse profession. It is for this reason that it was necessary to add adjustable risk aversion factors.

Model With Risk Aversion Adjustments : Analysts & Coaches

After implementing the risk adjustment factors, those who know football might see a more “reasonable” prediction that might still err on the side of being too risky, but would not call into question the validity of the model by a coach. Note that this model predicts that the team should “go for it” virtually any time there are 3 yards to go or fewer.

In more detail, this base adjustment was completed after considering how different average predictions for field goal or punt options were compared to the pass and run options. This adjustment was then done to all field goal or punt options. The second adjustment was to the punt WPA predictions as the raw model is very anti-punts, and this is quite unrealistic for any coach, at least in the NFL, for the foreseeable future. For this reason, it was decided that a punt risk adjustment would be added to all punt play type WPA predictions, ultimately shifting the prediction up to make it slightly more comparable to the run and pass WPA predictions.The last adjustment was a coach risk aversion adjustment, which a coach can choose on a scale of 0 (risky) to 1 (not risky). The base coach adjustment is 0.5.

Although this might seem counterintuitive to remove the differences in predictions the model has made, we know coaches are inherently risk averse, and ideally this model will still display differences in play type prediction distributions. As multiple different base risk adjustment factors and punt risk adjustment were tested, these adjustments seemed appropriate and put all the play types on a more typical playing field but still allowed the model to point out opportunities to potentially “go for it” or not.

Discussion : Analysts & Coaches

The results presented in this project are intriguing. Using this research one can choose any upcoming game in the NFL, pull its data from nflfastR, gather all the information needed and calculate the statistics for the model. Once that is complete, one can plug the respective information into the appropriate places in the R script and it will generate 72 plots for each situation considering the variables categorical time left and categorical score difference in the game. Additionally, a coach or staffer could choose what type of plot they would like to see in the “heat of the moment,” while also specifying their expected yard line for the field goal cut off. In addition, the number of bins (and how many observations each bin contains) could be altered depending on the coach’s thoughts. The coach could alter how wide these bins (yardline, time, and score differential) should stretch based on their coaching preferences and style. In addition, the coach would choose what coach risk aversion adjustment parameter was appropriate and this could change weekly.

Conclusion and Future Work : Analysts & Coaches

The results presented in this project are intriguing. Using this research one can choose any upcoming game in the NFL, pull its data from nflfastR, gather all the information needed and calculate the statistics for the model. Once that is complete, one can plug the respective information into the appropriate places in the R script and it will generate 72 plots for each situation considering the variables categorical time left and categorical score difference in the game. Additionally, a coach or staffer could choose what type of plot they would like to see in the “heat of the moment,” while also specifying their expected yard line for the field goal cut off. In addition, the number of bins (and how many observations each bin contains) could be altered depending on the coach’s thoughts. The coach could alter how wide these bins (yardline, time, and score differential) should stretch based on their coaching preferences and style. In addition, the coach would choose what coach risk aversion adjustment parameter was appropriate and this could change weekly.

Biblography and Reference

NFL 4th Down Bot

  • https://www.nfl4th.com/articles/4th-down-research.html#nfl-coaches-are-increasingly-adhering-to-nfl4th-recommendations

Similiar Paper

  • https://cs229.stanford.edu/proj2016/report/LeeChenLakshman-PredictingOffensivePlayTypesIntheNFL-report.pdf

ELO

  • https://en.wikipedia.org/wiki/Elo_rating_system

Hyperparameter tuning

  • Maximum depth - controls how many splits the model can do before it stops (a higher value means more conservative and less likely to overfit and vice versa)
  • Minimum child weight - decided how much the strength needed by a variable in order to create a split (a higher value means more conservative and less likely to overfit and vice versa)
  • Gamma - controls the amount of loss reduction needed to make a split
  • Sub sample - proportion of the training data rows used to make each tree (allows different observation of the training data to be used focused on)
  • Column sample -proportion of the training data columns used to make each tree (allows different features to shine)
  • Eta - the rate at which the models learns (too fast can miss patterns vs. too slow can ignore patterns)
  • Number of trees - controls the number of trees allowed in the model, each tree tries to correct for the previous trees mistakes (too many trees can over complicate the relationships and over fit the data and vice versa)

Multiverse Analysis

  • https://pubmed.ncbi.nlm.nih.gov/27694465/

nflfastR

  • https://github.com/nflverse/nflfastR